Fundamentals of Statistical Testing

Lecture 02

Martina Sladekova

A reminder image so that I don't forget to record the lecture on Zoom. Again.

PSAs

Register your Kahoot username:

https://canvas.sussex.ac.uk/courses/27531/quizzes

Make sure you can see all the channels on Discord:

Analysing Data Roadmap

Today

  • Speaking stats

  • Distributions

    • More about the normal distribution
  • Sampling

    • Sampling distribution

    • Central Limit Theorem

Speaking Stats

Speaking Stats

Learning to think in statistical terms is a skill just like:

  • Drawing/art/music

  • Weightlifting

  • Speaking a new language

You don’t need “innate talent” for any of these!

You do need (1) patience and (2) lots of practice (3) take things step by step

Speaking Stats

Speaking Stats


Skill Language Statistics Year
Beginner Learn vocabulary, basic sentences and grammar Learn terminology, fundamental concepts 1
Intermediate Extend to more situations, how to deal with irregular forms Extend to more types of tests/data, how to deal with bias 2
Advanced Create own sentences, have conversations Create own study, apply to own data 3

Speaking Stats

Treat stats (and R!) like you would learning a language

The core of generalising to new situations is grammar 😱

  • Generally: the rules for how to create new combinations in new situations

  • Not everyone’s favourite thing! But essential for fluency

Today’s lecture focuses on the “grammar” of statistics

  • It’s hard work, but essential for everything that follows

A two-panel meme. First panel: Screenshot from 'The Shining' of Wendy screaming as an axe comes through the door, the word 'Me' above her head. Second panel: The gap in the door with the Duolingo owl looking through, with the caption 'LEARN STATISTICS' at the bottom.

Recap and Distributions

Quick recap: Means and SDs

See PaaS lectures 6 and 7 for thorough revision!

Mean

The sum of all the numbers in a set, divided by the number of numbers.

Example: The mean of 1, 4, 6, and 3 is \(\frac{1 + 4 + 6 + 3}{4}\) = 3.5

Standard Deviation (SD)

A measure of the spread of data around the mean, on average

Calculated as the average difference from the mean:

Example: The SD of 1, 4, 6, and 3 is

\[ \sqrt{\frac{(1-3.5)^2+(4-3.5)^2+(6-3.5)^2+(3-3.5)^2}{4}} = 2.08 \]

It’s all Greek to me!

Symbols

Greek is for populations, Latin is for samples, hat is for population estimates

Meaning Mean SD
Population value \(\mu\) \(\sigma\)
Sample value \(\bar{x}\) \(s\)
Population estimate \(\hat{\mu}\) \(\hat{\sigma}\)

Distributions

Distribution

Numerically speaking, the number of observations per each value of a variable

  • Shows us which values occur more often and which less often
  • The shape formed by the bars of a bar chart/histogram

Distribution of Categorical Data

Distribution of Continuous Data

Known Distributions

  • Some shapes are “algebraically tractable”, i.e., there is a maths formula to draw the line
  • We can use them for statistics because they have particular known properties

A dicey example

  • 6-sided die (unbiased)

  • Each value is equally likely (e.g. the probability of getting a 6 is as likely as the probability of getting a 3)

  • Roll it 10 times, 50 times, and 1000 times - the more roll we throw it, the more the distributions of rolls resembles the shape of the probability function

The normal distribution

  • Also called the Gaussian distribution, the “bell curve”

  • The one you need to understand

  • Continuous, unimodal, symmetrical, and bell-shaped

    • Not every symmetrical bell-shaped distribution is normal (e.g. t)
  • It’s also about the proportions

    • The normal distribution has fixed proportions
    • A function of two parameters: \(\mu\) (mean) and \(\sigma\) (SD)

Area below the normal curve

  • No matter the particular shape of the given normal distribution, the proportions with respect to SD are the same

Proportions of a Normal Distribution

  • ∼68% of the area below the curve is within ±1 SD from the mean
  • 95% of the area below the curve is within ±1.96 SD from the mean
  • 99% of the area below the curve is within ±2.58 SD from the mean
  • These proportions make a continuous, unimodal, symmetrical, and bell-shaped distribution “normal”

Area below the normal curve

Critical Values

Proportions to Probability

  • The proportions are always the same in a normal distribution
  • If we know that a particular quantity is normally distributed…
    • We know something about the probability of observing a particular value!

This essentially allows us to (numerically) quantify whether something “unusual” or “surprising” given a particular baseline

What’s “unusual”?

  • The average adult goes to 127 social events per year 👀
  • The standard deviation is 40 - this is the average difference between each individual’s number of social events and the population value of 127
  • This means that:
    • 68% of people attend between 87 and 167 social events per year (127 \(\pm\) 40)
    • 95% of people attend between 48.6 and 205.4 social events per year (127 \(\pm\) 1.96 \(\times\) 40)

What’s “unusual”?

  • Meet Charlie
  • Charlie goes to 57 social events per year.
  • This is 70 events below the average
  • Is Charlie unusual?

What’s “unusual”?

  • How common is Charlie’s score of 57 ?

  • Shaded area: proportion of the population that attends more social events than Charlie.

  • Non-shaded area: proportion of people who attend fewer events than Charlie

Working out proportions - the long way around

  • We can convert our distributions in standard normal distribution by standardising the scores (number of events attended)

  • In a standard normal distribution: \(\mu\) = 0, \(\sigma\) = 1

Standardisation

The process of transforming any distribution into one with a mean of 0 and SD of 1. Also known as the process of transforming variables into Z-scores. We can transform into Z-scores by subtracting each score from the mean and then dividing by standard deviation.

For example, take scores 1, 4, 6 and 3. Their mean is M = 3.5 , and SD = 2.08 . We can work out the Z-scores as:

\(Z_1 = \frac{1-3.5}{2.08}\) = -0.72

\(Z_4 = \frac{4-3.5}{2.08}\) = 0.73

\(Z_6 = \frac{6-3.5}{2.08}\) = 1.68

\(Z_3 = \frac{3-3.5}{2.08}\) = 0.24

Working out proportions - the long way around

Let’s transform Charlie’s score of 57 into a Z-score:

  • M = 127

  • SD = 40

Z = (57 - 127) / 40 = -1.75

Working out proportions - the long way around

  • Perks of standardized scores - we know exactly how probable each Z-score is

  • Optional: Look up the probability of ZCharlie = -1.75 in a Z-table - e.g. https://www.z-table.com/ (or use R)

  • The probability of obtaining a is around 0.04 - so only 4 % of people go to fewer social events than Charlie (non-shaded area)

  • The area under the curve is equal to 1 - so the remaning shaded area represents 1- 0.04 = 0.96 or 96 % (people who attend more social events per year than charlie)

Working out proportions - the quickR way

Using the Z-score and standard normal distribution:

  • charlie_z = -1.75
pnorm(charlie_z, mean = 0, sd = 1, lower.tail = FALSE)
[1] 0.9599408

Using the original scores and distributional properties:

  • charlie_events = 57
pnorm(charlie_events, mean = 127, sd = 40, lower.tail = FALSE)
[1] 0.9599408

Working out critical values

  • Sometimes we want to find a value in a distribution that defines a cut-off point - for example top 5%.

Critical Value

A value that cuts off a specific proportion of a distribution

  • How many events would Charlie need to go to, if he wanted to be among top 5% of social-event-goers on the planet?

Working out critical values

  • We could work backwards through Z-scores: find a Z-score that corresponds to the probability of 0.95, then transform the Z-score into the original score.

    • Or use R.
  • Charlie would need to attend 193 social events per year (that’s 3.7 events per week! 😱 ) if he wanted to be in the top 5% of event-goers.

  • This is 136 more events than he attends at the moment.

qnorm(p = 0.95, mean = 127, sd = 40)
[1] 192.7941

One more example

  • Patty attends 190 events per year. Is she in the top 10% of event goers?

  • In a normal distribution with M = 127 (social events attended per year) and SD = 40, an individual would have to attend 178 events or more to be in the top 10% of event goers.

  • Therefore, with 190 events attended per year Patty is in the top 10%.

qnorm(p = 0.9, mean = 127, sd = 40)
[1] 178.2621

So What?

  • This idea of the probability of encountering a certain value, given a specific distribution, is absolutely fundamental to everything we will do this term!

    • If you feel a bit shaky on it now, don’t worry - we’ll practice it more
  • For now, focus on:

    • Revising the logic above

    • Learning the definitions

Sampling

From Values to Samples

  • We just saw the relationship between a value and its (known) distribution

  • Next let’s talk about the relationship between a sample statistic and its sampling distribution

Sampling from distributions

  • Collecting data on a variable = randomly sampling from distribution

Sample

A (usually randomly) selected subset of values of a particular size (e.g., 10, 50) taken from a larger pool of values, often the population

  • Many variables come from a normal distribution

  • Some variables might come from other distributions

    • Reaction times: log-normal distribution

    • Number of annual casualties due to horse kicks: Poisson distribution

    • Passes/fails on an exam: binomial distribution

Sampling more humans

  • So far, we looked at a score of one (unsociable) individual
  • In research, we typically work with scores of many individuals - we collect a sample.
  • Samples from the same population will be different from one another
  • We can ask 6 people how many events they go to. Then ask 6 different people the next day - everyone’s score will be different.
# as 6 people about their event-going:
rnorm(n = 6, mean = 127, sd = 40) 
[1]  94.72075 134.76754 125.31043 120.08560 232.28828 117.47832
# repeat
rnorm(n = 6, mean = 127, sd = 40)
[1] 142.32235 136.06493 118.61234  96.26304  90.73451 162.95180

Sampling from distributions

  • Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different

  • Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))

sample_day_1 <- rnorm(n = 50, mean = 127, sd = 40) 
sample_day_2 <- rnorm(n = 50, mean = 127, sd = 40) 

mean(sample_day_1)
[1] 133.3792
mean(sample_day_2)
[1] 126.1106

Sampling from distributions

  • Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different

  • Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))

Sampling distribution

  • If we took many samples of a given size (say N = 50) from the population and each time calculated \(\bar{x}\), the means would have their own distribution

Sampling Distribution (of the Mean)

The distribution of the means of many samples of a particular size.

The distribution is normal and centred around the true population mean, \(\mu\)

  • Every statistic has its own sampling distribution (not all normal though!)

Sampling distribution

many_sample_means <- replicate(100000, mean(rnorm(50, mean = 127, sd = 40)))
mean(many_sample_means)
[1] 127.0259

The Central Limit Theorem

  • You’ve just seen the Central Limit Theorem in action.

  • As N gets larger, the sampling distribution of \(\bar{x}\) tends towards a normal distribution with mean = \(\mu\)

  • True no matter the shape of the population distribution!

    • “Central” as in “really important” because, well, it is!

The Central Limit Theorem

  • Take a sample

  • Compute the mean

  • Put it on the plot below

  • Repeat

One more peek at the dice

  • We know that dice rolls are uniformly distributed - each number is equally likely

  • What if we calculate an average roll?

One more peek at the dice

dice_rolls_6 <- replicate(10000, mean(sample(50, x = 1:6, TRUE)))

ggplot2::ggplot() + 
  geom_histogram(aes(x = dice_rolls_6), binwidth = 0.1, fill = "darkcyan", colour = "white") + 
  labs(x = "Average roll value") + 
  theme_minimal(base_size = 15)

  • The CLT governs a lot of processes where randomness and sampling are involved.

  • This is extremely useful for research - our tests are not immediately doomed if we collect a messy sample

Take-home messages

  • There are many mathematically well-described distributions

  • Normal (Gaussian) distribution is one of them

    • Continuous, unimodal, symmetrical, bell-shaped

    • Must have the right proportions to be normal!

    • We can use these proportions to work out critical values

Take-home messages

  • Statistics of random samples differ from parameters of a population

  • As N gets bigger, sample statistic approaches population parameters

  • Distribution of sample means (or other statistics) is the sampling distribution

  • Central Limit Theorem

    • Really important!

    • Sampling distribution of the mean tends to normal even if population distribution is not normal

  • Understanding distributions, sampling distributions and CLT it most of what you need to understand all the stats techniques we will cover.

NEXT WEEK

  • More sampling distributions

  • Quantifying uncertainty with standard errors and confidence intervals